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Linear Models for Classification 
Roadmap 
@ When Can Machines Learn? 
Ө Why Can Machines Learn? 
Ө How Can Machines Learn? 


Lecture 10: Logistic Regression 


gradient descent on cross-entropy error 
to get good logistic hypothesis 












Lecture 11: Linear Models for Classification 
Linear Models for Binary Classification 
Stochastic Gradient Descent 
Multiclass via Logistic Regression 
Multiclass via Binary Classification 





Ө How Can Machines Learn Better? 
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Linear Models for Classification Linear Models for Binary Classification 


Linear Models Revisited 


linear scoring function: s = w’x 














linear classification ' linear regression | logistic regression 
h(x) — sign(s) A(x) = s h(x) = 6(5) 

Xp Xo 

" 9 s 22) 5 

ж (б) h(x) VY) h(x) | х 09 h(x) 

ха . ха ха 

plausible err = 0/1 friendly err = squared plausible err = cross-entropy 

discrete Ej; (w): quadratic convex E(w): smooth convex Ej (м): 


NP-hard to solve closed-form solution gradient descent 


can linear regression or logistic regression 
help linear classification? 
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Linear Models for Classification Linear Models for Binary Classification 


Error Functions Revisited 


T 


linear scoring function: s = м х 





for binary classification y € {—1, +1} 


linear classification ' linear regression | logistic regression 
h(x) - sign(s) = © h(x) o(s) 
еп(ћ,х,у) = [h(x) 4 у] = (h(x)—y)? | exhxy) = -Inh(yx) 
еттоу1(5, y) errson(S, y) errce(S, y) 
= [sign(s) 3 y] (s—y)? = In(1 + exp(—ys) 


= [sign(ys) = 1] rss 





(y s): classification correctness score | 


Linear Models for Classification Linear Models for Binary Classification 


Visualizing Error Functions 
0/1 епол1(8,у) = [sign(ys) 4 1] 
8Ш effsqn(S,¥) = (ys—1)? 


ce етсе (S,y) = In(1 +exp(—ys)) 
scaled ce ertsce(S,y) = 109,(1 + exp(—ys)) 
6 —0/1 e 0/1: 1 iff ys < 0 
е sqr: large if ys «< 1 
1 but over-charge ys > 1 


small еггвон — small erro 
== е се: monotonic of ys 
0 small еггс < small erro 

ys ? e scaled ce: a proper upper bound of 0/1 
small errsce = small етго/1 


upper bound: 
useful for designing algorithmic error err | 
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Linear Models for Classification Linear Models for Binary Classification 


Visualizing Error Functions 
0/1 епол1(8,у) = [sign(ys) 4 1] 
8Ш  епсон(5,У) = (ys—1) 








се ertce (S,y) = In(1 +exp(—ys)) 
scaled ce ertsce(S,Y) =  100-(1 + exp(—ys)) 

6 --0/1 e 0/1: 1 iff ys < 0 

Ed е sqr: large if ys << 1 

2 but over-charge ув > 1 

2 small еггвон — small erro 
1 е ce: monotonic of ys 
0 small еггс < small erro 
-3 -2-1 0 1 2 


ys ? e scaled ce: a proper upper bound of 0/1 
small errsce €» small erro 4 


upper bound: 
useful for designing algorithmic error err | 
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Linear Models for Classification Linear Models for Binary Classification 


Visualizing Error Functions 
0/1 епол1(8,у) = [sign(ys) 4 1] 
8Ш  епсон(5,У) = (ys—1) 








се ertce (S,y) = In(1 +exp(—ys)) 
scaled ce ertsce(S,Y) = 100-(1 + exp(—ys)) 

6 --0/1 e 0/1: 1 iff ys < 0 

= е sqr: large if ys «< 1 

2 but over-charge ув >> 1 

2 small еггвон — small erro 
1 е ce: monotonic of ys 
0 small еггс < small erro 
-3 -2 -1 0 1 2 


ys ?  . scaled ce: a proper upper bound of 0/1 
small errsce €» small erro 4 


upper bound: 
useful for designing algorithmic error err | 
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Linear Models for Classification Linear Models for Binary Classification 


Visualizing Error Functions 
0/1 епо/1(8,Уу) = [sign(ys) 4 1] 
607 ertsar(S, y) [ys 
ce ertce (S,y) = In(1 +exp(—ys)) 
scaled ce ertsce(S,Y) =  log;(1 + exp(—ys)) 


--0/1 e 0/1: 1 iff ys < 0 
—sqr : i 
= е sqr: large if ys «< 1 
but over-charge ys > 1 
small еггвон — small erro 


е ce: monotonic of ys 
small еггс < small erro 


ys ? e scaled ce: a proper upper bound of 0/1 
small errsce €» small erro 4 


upper bound: 
useful for designing algorithmic error err | 
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Linear Models for Classification Linear Models for Binary Classification 
Theoretical Implication of Upper Bound 


For any ys where s = w’x 


егго/1(8,у) < ertsce(S, У) = руетсғ($, у). 


= Еи) < Eg*(w)- 12 (м) 


z г? Е (w) 


0/1 
Ew) = E(w 














VC on 0/1: VC-Reg on CE : 
0/1 0/1 
Ej (w) < Ж | E(w) < шк 
< gzEm (w)-- 097 < пор (W) + 2209 





0/1 
small ЕСЕ(ұу) => small EO (м): 


logistic/linear reg. for linear classification 






Linear Models for Classification Linear Models for Binary Classification 
Regression for Classification 

Ө гип logistic/linear reg. on D with yn € {—1, +1} to get Waec 

Ө return g(x) = sign(wZ.,x) 





linear regression logistic regression 


e pros: efficient + * pros: e pros: 
strong guarantee ‘easiest’ ‘easy’ 
if lin. separable optimization optimization 

e cons: works only if ° cons: loose * cons: loose 
lin. separable, bound of erro, bound of errg/; for 
otherwise needing for large |ys| very negative ys 





pocket heuristic 


е linear regression sometimes used to set Wo for 
PLA/pocket/logistic regression 


* logistic regression often preferred over pocket 
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Linear Models for Classification Linear Models for Binary Classification 


Fun Time 









Following the definition in the lecture, which of the following is 
not always > erroji(y, S) when y € {—1, +1}? 


© еггоу(у, S) 
Ө errsar(y, S) 
Ө errce(y, 5) 

Ө ertsce(Y, 5) 










Reference Answer: 9 


Too simple, uh? :-) Anyway, note that erro, is 
surely an upper bound of itself. 
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Linear Models for Classification Stochastic Gradient Descent 


Two Iterative Optimization Schemes 
FOr’ 0: 2. 
Wi <= Wr nVa 
when stop, return last w as g 















logistic regression 
check D and decide м; ч (or 
new W) by all examples 

O(N) time per iteration :-( 


pick (Xn, ул) and decide м; ; by 
the one example 


O(1) time per iteration :-) 





x 


update: 2 


y logistic regression with 
O(1) time per iteration? | 
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Linear Models for Classification Stochastic Gradient Descent 


Logistic Regression Revisited 


N 
Wit € Wr 9 5 oe (Умм xn) (ynXn) 


nz 








—V Ein(Wt) 


e want: update direction v ~ — V Ei, (W;) 
while computing v by one single (Xp, Yn) 
N 
e technique on removing 7/22: 
T] 
view as expectation € over uniform choice of n! 





stochastic gradient: 
Vw err(W, Xn, yn) with random n 
true gradient: 
Vw En(w)= E Vw еш(м/,Хл, yn) 


random n 





Linear Models for Classification Stochastic Gradient Descent 


Stochastic Gradient Descent (SGD) 
stochastic gradient = true gradient + zero-mean 'noise' directions | 


Stochastic Gradient Descent 
idea: replace true gradient by stochastic gradient 
after enough steps, 

average true gradient ғә average stochastic gradient 


pros: simple & cheaper computation :-) 
—useful for big data or online learning 


cons: less stable in nature 










SGD logistic regression, looks familiar? :-): 


Wri < Wt 700 (уам xn) (YnXn) 
Sooo y 
— Verr(Wr,Xn.yn) 
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Linear Models for Classification Stochastic Gradient Descent 


PLA Revisited 
SGD logistic regression: 


Wii € WE 9:0 (yaw? Xn) (YnXn) 


Wii ewl. m sign(w/ ха) (YnXn) 





e SGD logistic regression ~ ‘soft’ PLA 
e PLA ~ SGD logistic regression with 7 = 1 when w/ x; large 





two practical rule-of-thumb: 
е stopping condition? t large enough 
• 7? 0.1 when x in proper range 
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Linear Models for Classification Stochastic Gradient Descent 


Fun Time 


Consider applying SGD on linear regression for big data. What 

is the update direction when using the negative stochastic 

gradient? 
Ө х, 

Ө улХд 

Ө 2(w/x; — Yn)Xn 

Ө 2(у,-м/ха)ха 















Reference Answer: (4) 


Go check lecture 9 if you have forgotten 
about the gradient of squared error. :-) 

Anyway, the update rule has a nice physical 
interpretation: improve w; by 'correcting' 
proportional to the residual (ул — w/ xn). 
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Linear Models for Classification Multiclass via Logistic Regression 


Multiclass Classification 














2 У ян 1 p vg А; x} 
(4-class classification) 

e many applications in 
practice, especially for 

‘recognition’ 





next: use tools for { x, о} classification to 
L, ô, A, +} classification 









Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time 





























* 
x ж о 
X. x oh o 
x 
о 
х 
х хох x X 
x a 
* хх ж x Xx 
x 
or not? 4 A x} 
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Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time 




















© or not? { 
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Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time 




















х 
x ж x x 
X. x хх 
х x 
x 
x 
00 о х Р х 
9 оо Q x x х 
о 
A or not? {0 A x} 
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Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time 








* 
x ж X x 
X. x хх 
х «ж 
о 
хх X 9 бо 
х 
хх x о 8а 























+ or not? { ^N о) 
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Linear Models for Classification Multiclass via Logistic Regression 


Multiclass Prediction: Combine Binary Classifiers 





% 
% ш 
7 oo, oot m 
* 
AAA Ж Ж 


ХАА x X» 














* o * х 
o x 
ж 25 ose, 50:95 n GE we GP x 
xm x x x* 
x x x o 
xax ox x кх x xx "EG og? 9 NONE ыы ° во 
*axx Xo кени a 149009) 2 кх хх o 8o 
x x о х 





but ties? :-) | 
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Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time Softly 






































P(D|x)? 1 ОАТ. | 


Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time Softly 























Р($|х)? { ИЕ EM c 


Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time Softly 




















RAK A = EDT | 
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Linear Models for Classification Multiclass via Logistic Regression 


One Class at a Time Softly 
































Р(х)? {= x,0 9 x, A o) 











Linear Models for Classification Multiclass via Logistic Regression 


Multiclass Prediction: Combine Soft Classifiers 

















g(x) = argmax,cy ? (мох) 
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Linear Models for Classification Multiclass via Logistic Regression 


One-Versus-All (OVA) Decomposition 


Ө forkey 
obtain wy by running logistic regression on 


Dik = {(Xn, Yn = 2 [Yn = К] - ы 


Ө return g(x) = argmax, y, (wx) 





e pros: efficient, 
can be coupled with any logistic regression-like approaches 


e cons: often unbalanced Dy when K large 
e extension: multinomial (‘coupled’) logistic regression 


OVA: a simple multiclass meta-algorithm 
to keep in your toolbox | 
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Linear Models for Classification Multiclass via Logistic Regression 


Fun Time 





Which of the following best describes the training effort of OVA 
decomposition based on logistic regression on some K-class 
classification data of size N? 
Q learn К logistic regression hypotheses, each from data of 
size N/K 
Ө learn К logistic regression hypotheses, each from data of 
size NIn K 
Ө learn К logistic regression hypotheses, each from data of size М 
Ө learn К logistic regression hypotheses, each from data of size NK 











Reference Answer: 9 


Note that the learning part can be easily 
done in parallel, while the data is essentially 
of the same size as the original data. 
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Linear Models for Classification Multiclass via Binary Classification 


Source of Unbalance: One versus All 

















idea: make binary classification problems 
more balanced by one versus one 
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Linear Models for Classification Multiclass via Binary Classification 


One versus One at a Time 


























nil, х = nil) 





HevanTien Lin (NTUCSIE) ШИЛИЙН: 


Linear Models for Classification 


Multiclass via Binary Classification 


One versus One at a Time 
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Linear Models for Classification Multiclass via Binary Classification 


One versus One at a Time 
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Linear Models for Classification Multiclass via Binary Classification 


One versus One at a Time 
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Linear Models for Classification Multiclass via Binary Classification 


One versus One at a Time 
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Linear Models for Classification Multiclass via Binary Classification 


One versus One at a Time 
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Linear Models for Classification Multiclass via Binary Classification 


Multiclass Prediction: Combine Pairwise Classifiers 




















A 
х o o 
о о о 
iue СЕ 2 со 8590 6%, 
х х х 
xax x is x xx xe ооо Бъ 
х xx % ххх a Helo OIG LEM 
x x o 





д(х) = tournament champion {мт ox} 
(voting of classifiers) 
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Linear Models for Classification Multiclass via Binary Classification 


One-versus-one (OVO) Decomposition 


© for (к, д суху 
obtain Wik, by running linear binary classification оп 


Dig = {(Xn, Yn = 2 [yn = К] - 1): Yn =k OF ya = 4} 





Ө return g(x) = tournament champion {wi ax} 


e pros: efficient (‘smaller training problems), stable, 
can be coupled with any binary classification approaches 


e cons: use O(K?) мик 
—more space, slower prediction, more training 


OVO: another simple multiclass 
meta-algorithm to keep in your toolbox | 
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Linear Models for Classification Multiclass via Binary Classification 


Fun Time 


Assume that some binary classification algorithm takes exactly N? 
CPU-seconds for data of size N. Also, for some 10-class multiclass 
classification problem, assume that there are N/10 examples for each 
class. Which of the following is total CPU-seconds needed for OVO 
decomposition based on the binary classification algorithm? 


[1] 505 МЗ 
Ө 5м 
e 2З 
Өө? 


Reference Answer (2) 


There аге 45 binary classifiers, each trained 
with data of size (2N)/10. Note that OVA 
decomposition with the same algorithm would 
take 10N? time, much worse than OVO. 
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Linear Models for Classification Multiclass via Binary Classification 
Summary 
@ When Can Machines Learn? 
Ө Why Can Machines Learn? 
Q How Can Machines Learn? 


Lecture 10: Logistic Regression 
Lecture 11: Linear Models for Classification 
e Linear Models for Binary Classification 
three models useful in different ways 
e Stochastic Gradient Descent 
follow negative stochastic gradient 
e Multiclass via Logistic Regression 
predict with maximum estimated P(Kk|x) 
e Multiclass via Binary Classification 
predict the tournament champion 


e next: from linear to nonlinear 
@ How Can Machines Learn Better? 
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